Character Sequence Modeling for Transliteration
نویسندگان
چکیده
The Character Sequence Modeling (CSM), typically called the Language Modeling, has not received sufficient attention in the current transliteration research. We discuss the impact of various CSM factors like word granularity, smoothing technique, corpus variation, and word origin on the transliteration accuracy. We demonstrate the importance of CSM by showing that for transliterating into English, for two very different languages, Hindi and Persian, systems employing only monolingual resources and simple non-probabilistic character mappings achieve accuracy close to that of the baseline statistical systems employing parallel transliteration pairs. It shows that a reasonable transliteration system can be built for resource scarce languages that lack large parallel corpora.
منابع مشابه
A Bayesian model of bilingual segmentation for transliteration
In this paper we propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the o...
متن کاملA Syllable-based Name Transliteration System
This paper describes the name entity transliteration system which we conducted for the “NEWS2009 Machine Transliteration Shared Task” (Li et al 2009). We get the transliteration in Chinese from an English name with three steps. We syllabify the English name into a sequence of syllables by some rules, and generate the most probable Pinyin sequence with the mapping model of English syllables to P...
متن کاملThe Application of Bayesian Alignment Techniques to Transliteration Generation and Mining
Bayesian techniques have recently been applied to many areas of natural language processing, and have proven themselves particularly useful in areas involving segmentation and alignment. This paper looks at the direct application of these techniques to the co-segmentation/alignment of grapheme sequences. We detail a novel Bayesian model for unsupervised bilingual character sequence alignment of...
متن کاملApplying Neural Networks to English-Chinese Named Entity Transliteration
This paper presents the machine transliteration systems that we employ for our participation in the NEWS 2016 machine transliteration shared task. Based on the prevalent deep learning models developed for general sequence processing tasks, we use convolutional neural networks to extract character level information from the transliteration units and stack a simple recurrent neural network on top...
متن کاملDirecTL: a Language Independent Approach to Transliteration
We present DIRECTL: an online discriminative sequence prediction model that employs a many-to-many alignment between target and source. Our system incorporates input segmentation, target character prediction, and sequence modeling in a unified dynamic programming framework. Experimental results suggest that DIRECTL is able to independently discover many of the language-specific regularities in ...
متن کامل